Search CORE

79 research outputs found

Multi-Task Learning for Email Search Ranking with Auxiliary Query Clustering

Author: Bendersky Michael
Karimzadehgan Maryam
Metzler Donald
Qin Zhen
Shen Jiaming
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/09/2018
Field of study

User information needs vary significantly across different tasks, and therefore their queries will also differ considerably in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. These studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in email search scenarios. In this paper, we study how to obtain query type in an unsupervised fashion and how to incorporate this information into query-dependent ranking models. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we study three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that views query type as the label for the auxiliary query cluster prediction task. This multi-task model is trained to simultaneously rank documents and predict query types. Our experiments on tens of millions of real-world email search queries demonstrate that the proposed multi-task model can significantly outperform the baseline neural ranking models, which either do not incorporate query type information or just simply feed query type as an additional feature.Comment: CIKM 201

arXiv.org e-Print Archive

Crossref

It's All Relative! -- A Synthetic Query Generation Approach for Improving Zero-Shot Relevance Prediction

Author: Bendersky Michael
Chaudhary Aditi
Raman Karthik
Publication venue
Publication date: 14/11/2023
Field of study

Recent developments in large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations. This has enabled building better IR models, especially for tasks with no training data readily available. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. a text document) and generate a query relevant to that context, or condition the QGen model additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as they require the model to reason about the desired label and the input from a handful of examples. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task setup for the model to reason about. Extensive experimentation across seven IR datasets shows that synthetic queries generated in such a fashion translates to a better downstream performance, suggesting that the generated queries are indeed of higher quality.Comment: 18 page

arXiv.org e-Print Archive

LambdaLoss: Metric-Driven Loss for Learning-to Rank

Author: Bendersky Michael
Golbandi Nadav
Li Cheng
Najork Marc
Wang Xuanhui
Publication venue: Technical Disclosure Commons
Publication date: 31/05/2018
Field of study

How to directly optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an interesting but challenging problem, because ranking metrics are either flat or discontinuous everywhere. Among existing approaches, LambdaRank is a novel algorithm that incorporates metrics into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, what is the underlying loss that LambdaRank optimizes for? Due to this, it is unclear whether LambdaRank will always converge. In this paper, we present a well-defined loss for LambdaRank in a probabilistic framework and show that LambdaRank is a special configuration in our framework. This framework, which we call LambdaLoss, provides theoretical justification for Lamb-daRank. Furthermore, we propose a few more metric-driven loss functions in our LambdaLoss framework. Our loss functions have clear connection to ranking metrics and can be optimized in our framework efficiently. Experiments on three publicly available data sets show that our methods significantly outperform the state-of-the-art learning-to-rank algorithms. This confirms both the theoretical soundness and the practical effectiveness of the LambdaLoss framework

Technical Disclosure Common

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Author: Bendersky Michael
Chen Jiecao
Najork Marc
Raman Karthik
Srinivasan Krishna
Publication venue
Publication date: 03/03/2021
Field of study

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example

arXiv.org e-Print Archive